Simple Regret Optimization in Online Planning for Markov Decision Processes

نویسندگان

Zohar Feldman

Carmel Domshlak

چکیده

We consider online planning in Markov decision processes (MDPs). In online planning, the agent focuses on its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to perform next. The performance of algorithms for online planning is assessed in terms of simple regret, which is the agent’s expected performance loss when the chosen action, rather than an optimal one, is followed. To date, state-of-the-art algorithms for online planning in general MDPs are either best effort, or guarantee only polynomial-rate reduction of simple regret over time. Here we introduce a new Monte-Carlo tree search algorithm, BRUE, that guarantees exponentialrate reduction of simple regret and error probability. This algorithm is based on a simple yet non-standard state-space sampling scheme, MCTS2e, in which different parts of each sample are dedicated to different exploratory objectives. Our empirical evaluation shows that BRUE not only provides superior performance guarantees, but is also very effective in practice and favorably compares to state-of-the-art. We then extend BRUE with a variant of “learning by forgetting.” The resulting set of algorithms, BRUE(α), generalizes BRUE, improves the exponential factor in the upper bound on its reduction rate, and exhibits even more attractive empirical performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Journal Track Paper Abstracts

متن کامل

The Price of Bandit Information for Online Optimization

In the online linear optimization problem, a learner must choose, in each round, a decision from a setD ⊂ R in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and in the bandit setting (where only the scala...

متن کامل

Regret optimality in semi-Markov decision processes with an absorbing set

The optimization problem of general utility case is considered for countable state semi-Markov decision processes. The regret-utility function is introduced as a function of two variables, one is a target value and the other is a present value. We consider the expectation of the regret-utility function incured until the reaching time to a given absorbing set. In order to characterize the regret...

متن کامل

Online stochastic optimization under time constraints

This paper considers online stochastic combinatorial optimization problems where uncertainties, i.e., which requests come and when, are characterized by distributions that can be sampled and where time constraints severely limit the number of offline optimizations which can be performed at decision time and/or in between decisions. It proposes online stochastic algorithms that combine the frame...

متن کامل

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

J. Artif. Intell. Res.

دوره 51 شماره

صفحات -

تاریخ انتشار 2014

Simple Regret Optimization in Online Planning for Markov Decision Processes

نویسندگان

چکیده

منابع مشابه

Journal Track Paper Abstracts

The Price of Bandit Information for Online Optimization

Regret optimality in semi-Markov decision processes with an absorbing set

Online stochastic optimization under time constraints

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

عنوان ژورنال:

اشتراک گذاری